Adapted from the course material provided by the Institute for Systems Biology, Seattle, Washington, USA.

0. Lodading libraries

Load required libraries.

# 0.1. Load required packages for today
library(RColorBrewer)   # 1. library to access easily multiple colors for plotting
library(Rtsne)          # 2. R implementation of the t-SNE algorithm
library(tictoc)         # 3. library to profile execution time
library(Seurat)         # 4. library to single-cell analysis
## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3
library(magrittr)       # 5. library for introducing pipe syntax: https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
library(dplyr)          # 6. useful to manipulate data frames
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

1. Biomarker discovery

Working on the guided clustering tutorial from Seurat.

1.0. Load data PBMC data

We are analyzing a very commonly used dataset of peripheral blood mononuclear cells (PBMC) freely available from 10X Genomics. The experiment yield 2,700 single cell transcriptomes, sequenced on the Illumina NextSeq 500. Data originally retrieved from here. Originally retrieved from: https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

Once the tar file is downloaded, you can use the following command to load the data into R:

pbmc.data=Read10X(data.dir="~/scratch/filtered_gene_bc_matrices/hg19/")

We then saved the data into and R object for convenient loading in this course using the command:

save(pbmc.data, file = "~/scratch/pbmc.RData")

So now, let’s load the data and start the exploration!

load("data/pbmc.RData") 

1.1. Initialize a Seurat object

Initialize the Seurat object with the raw (non-normalized data). Keep all genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at least 200 detected genes.

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc
## An object of class Seurat 
## 13714 features across 2700 samples within 1 assay 
## Active assay: RNA (13714 features)

1.2. Quality control

In addition to the number of genes detected, the community typically filters out cells with high percentage of mapping reads to the mitochondrial genome. This is due to low-quality / dying cells often exhibit extensive mitochondrial contamination. Mitochondrial QC metrics are computed with the PercentageFeatureSet function, which calculates the percentage of counts originating from a set of features (mitochondrial genes). All genes starting with MT- as a set of mitochondrial genes

Info: The [[ operator can add columns to object metadata. This is a great place to stash QC stats.

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

Visualize QC metrics as a violin plot.

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

FeatureScatter is typically used to visualize feature-feature relationships, but can be used for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
CombinePlots(plots = list(plot1, plot2))

Let’s get the high quality transcriptomes by filtering out cells that have unique gene counts over 2,500 or less than 200 genes.

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

1.3. Data normalization

We can normalize expression data by a global-scaling normalization method LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result.

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

1.4. Feature selection

Next, we will retrieve genes with high cell-to-cell variation across the data set. In this case we will consider both. Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

Let’s use the FindVariableFeatures function that directly models the mean-variance relationship inherent in single-cell data.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1=VariableFeaturePlot(pbmc)
LabelPoints(plot = plot1, points = top10, repel = TRUE)
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.
## When using repel, set xnudge and ynudge to 0 for optimal results
## Warning: Transformation introduced infinite values in continuous x-axis

1.4 Scale the data

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
## Centering and scaling data matrix

1.5 Perform linear dimensional reduction

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
##     FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
##     CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
##     MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
##     CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
##     PLAC8, BLNK, MALAT1, SMIM14, PLD4, LAT2, IGLL5, P2RX5, SWAP70, FCGR2B 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HLA-DPB1, HIST1H2AC, PF4, TCL1A 
##     SDPR, HLA-DPA1, HLA-DRB1, HLA-DQA2, HLA-DRA, PPBP, LINC00926, GNG11, HLA-DRB5, SPARC 
##     GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, GZMB, CLU, TUBB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
##     AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
##     LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
##     GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, CCL5 
##     RBP7, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
##     PTGES3, LILRB2, MAL, CD27, HN1, CD2, GDI2, ANXA5, CORO1B, TUBA1B 
##     FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB

Visualize PC loadings.

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

Visualize cells in the first two PCs.

DimPlot(pbmc, reduction = "pca")

Heatmap of expression for the 15 genes with stronger loadings.

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

Iterate heatmap visualization across 10 components.

DimHeatmap(pbmc, dims = 1:9, cells = 500, balanced = TRUE)

1.6. Determine the ‘dimensionality’ of the dataset

1.6.1. Elbow approach

ElbowPlot(pbmc)

1.7. Cluster cells

pbmc <- FindNeighbors(pbmc, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 96033
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8720
## Number of communities: 9
## Elapsed time: 0 seconds

Visualize clusters

We can use UMAP to perform dimensionality reduction.

# if RunUMAP fails, most likely it is because UMAP is missing. Run the command line "pip install umap-learn" in a your termina and try again
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")

# alternatively, we can use t-SNE
#pbmc=RunTSNE(pbmc,dims=1:25)
#TSNEPlot(pbmc) # add file='figure.pdf' if you have some Quartz related errors.

1.8. Find cluster biomarkers

Find all markers of cluster 1.

cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
##             p_val avg_logFC pct.1 pct.2    p_val_adj
## IL32 1.894810e-92 0.8373872 0.948 0.464 2.598542e-88
## LTB  7.953303e-89 0.8921170 0.981 0.642 1.090716e-84
## CD3D 1.655937e-70 0.6436286 0.919 0.431 2.270951e-66
## IL7R 3.688893e-68 0.8147082 0.747 0.325 5.058947e-64
## LDHB 2.292819e-67 0.6253110 0.950 0.613 3.144372e-63

Find all markers distinguishing cluster 5 from clusters 0 and 3

cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
##                       p_val avg_logFC pct.1 pct.2     p_val_adj
## FCGR3A        7.583625e-209  2.963144 0.975 0.037 1.040018e-204
## IFITM3        2.500844e-199  2.698187 0.975 0.046 3.429657e-195
## CFD           1.763722e-195  2.362381 0.938 0.037 2.418768e-191
## CD68          4.612171e-192  2.087366 0.926 0.036 6.325132e-188
## RP11-290F20.3 1.846215e-188  1.886288 0.840 0.016 2.531900e-184

Find markers for every cluster compared to all remaining cells, report only the positive ones

pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
## # A tibble: 18 x 7
## # Groups:   cluster [9]
##        p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene    
##        <dbl>     <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
##  1 1.96e-107     0.730 0.901 0.594 2.69e-103 0       LDHB    
##  2 1.61e- 82     0.922 0.436 0.11  2.20e- 78 0       CCR7    
##  3 7.95e- 89     0.892 0.981 0.642 1.09e- 84 1       LTB     
##  4 1.85e- 60     0.859 0.422 0.11  2.54e- 56 1       AQP3    
##  5 0.            3.86  0.996 0.215 0.        2       S100A9  
##  6 0.            3.80  0.975 0.121 0.        2       S100A8  
##  7 0.            2.99  0.936 0.041 0.        3       CD79A   
##  8 9.48e-271     2.49  0.622 0.022 1.30e-266 3       TCL1A   
##  9 2.96e-189     2.12  0.985 0.24  4.06e-185 4       CCL5    
## 10 2.57e-158     2.05  0.587 0.059 3.52e-154 4       GZMK    
## 11 3.51e-184     2.30  0.975 0.134 4.82e-180 5       FCGR3A  
## 12 2.03e-125     2.14  1     0.315 2.78e-121 5       LST1    
## 13 7.95e-269     3.35  0.961 0.068 1.09e-264 6       GZMB    
## 14 3.13e-191     3.69  0.961 0.131 4.30e-187 6       GNLY    
## 15 1.48e-220     2.68  0.812 0.011 2.03e-216 7       FCER1A  
## 16 1.67e- 21     1.99  1     0.513 2.28e- 17 7       HLA-DPB1
## 17 7.73e-200     5.02  1     0.01  1.06e-195 8       PF4     
## 18 3.68e-110     5.94  1     0.024 5.05e-106 8       PPBP

The ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).

cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
head(cluster1.markers)
##       myAUC  avg_diff power pct.1 pct.2
## RPS12 0.822 0.5029988 0.644 1.000 0.991
## RPS27 0.822 0.5020359 0.644 0.999 0.992
## RPS6  0.820 0.4673635 0.640 1.000 0.995
## RPL32 0.815 0.4242773 0.630 0.999 0.995
## RPS14 0.807 0.4283480 0.614 1.000 0.994
## RPS25 0.802 0.5203411 0.604 0.997 0.975

Visualize markers expression distributions.

VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

Visualize markers expression distributions using raw counts.

VlnPlot(pbmc, features = c("MS4A1", "CD79A"),slot = "counts", log = TRUE)

Visualize markers expression in the embedded space.

#FeaturePlot(pbmc,features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
FeaturePlot(pbmc,features = c("LDHB", "LTB", "S100A9", "CD79A", "CCL5", "FCGR3A", "GZMB", "FCER1A","PF4"))

Visualize markers as a heatmap

top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()

Using literature we can associate a cell type to identified clusters via gene markers.

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", 
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5,file='figure.final.pdf') + NoLegend()

2.0. Load expression data

This expression data, as the ones you will be working with, they are already pre-processed. No need to filter for mitochondrial counts.

load("data/malignantCells.2k.RData")
str(malignantCellsExpression[1:5,1:5])
## 'data.frame':    5 obs. of  5 variables:
##  $ RPS11 : num  7.89 8.33 7.83 7.6 8.13
##  $ DECR1 : num  0 3.3 0 0 4.14
##  $ RPS18 : num  9.62 9.98 10.3 9.58 10.34
##  $ SUMO1 : num  0.747 5.154 5.179 4.247 4.58
##  $ UQCR11: num  5.6 4.69 5.38 2.55 4.65

2.1. PCA

results=prcomp(malignantCellsExpression)
plot(results$x[,'PC1'],results$x[,'PC2'],main='PCA of malignant cells',pch=19,xlab='PC 1',ylab='PC 2')

Associate patient metadata to plotting colors.

plottingColors=brewer.pal(length(unique(malignantCellsPatientMetadata)),'Dark2')
names(plottingColors)=unique(malignantCellsPatientMetadata)

Inspect variables.

str(malignantCellsPatientMetadata)
##  chr [1:1061] "Mel81" "Mel80" "Mel81" "Mel80" "Mel81" "Mel81" "Mel80" ...
str(unique(malignantCellsPatientMetadata))
##  chr [1:6] "Mel81" "Mel80" "Mel79" "Mel78" "Mel88" "Mel89"
str(plottingColors)
##  Named chr [1:6] "#1B9E77" "#D95F02" "#7570B3" "#E7298A" "#66A61E" ...
##  - attr(*, "names")= chr [1:6] "Mel81" "Mel80" "Mel79" "Mel78" ...
str(plottingColors[malignantCellsPatientMetadata])
##  Named chr [1:1061] "#1B9E77" "#D95F02" "#1B9E77" "#D95F02" "#1B9E77" ...
##  - attr(*, "names")= chr [1:1061] "Mel81" "Mel80" "Mel81" "Mel80" ...

Plot again with patient metadata as different colors.

plot(results$x[,'PC1'],results$x[,'PC2'],main='PCA of malignant cells',col=plottingColors[malignantCellsPatientMetadata],pch=19,xlab='PC 1',ylab='PC 2')
legend('bottomright',legend=unique(malignantCellsPatientMetadata),fill=plottingColors)

2.2. tSNE

results2D=Rtsne(malignantCellsExpression,dims=2,perplexity=50,theta=0)
plot(results2D$Y,main='tSNE of malignant cells, p=50',col=plottingColors[malignantCellsPatientMetadata],pch=19,xlab='tSNE Component 1',ylab='tSNE Component 2')
legend('topright',legend=unique(malignantCellsPatientMetadata),fill=plottingColors)

3.0. Load melanoma data sets

We now look at single-cell RNA-Seq data from malignant and non-malignant melanoma cells.

load("data/malignantCells.2k.RData")
str(malignantCellsExpression[1:5,1:5])
## 'data.frame':    5 obs. of  5 variables:
##  $ RPS11 : num  7.89 8.33 7.83 7.6 8.13
##  $ DECR1 : num  0 3.3 0 0 4.14
##  $ RPS18 : num  9.62 9.98 10.3 9.58 10.34
##  $ SUMO1 : num  0.747 5.154 5.179 4.247 4.58
##  $ UQCR11: num  5.6 4.69 5.38 2.55 4.65
load("data/nonMalignantCells.2k.RData")
str(nonMalignantCellsExpression[1:5,1:5])
## 'data.frame':    5 obs. of  5 variables:
##  $ RPS11       : num  7.86 9.18 9.32 8.16 8.15
##  $ TRAF3IP2-AS1: num  2.15 2.001 2.104 0 0.797
##  $ RPS18       : num  9.51 6.22 0 8.46 9.37
##  $ SUMO1       : num  0 0 7.13 5.5 0
##  $ UQCR11      : num  3.21 0 7.14 2.97 0

3.1. PCA – malignant cells

results=prcomp(malignantCellsExpression)
plot(results$x[,'PC1'],results$x[,'PC2'],main='PCA of malignant cells',pch=19,xlab='PC 1',ylab='PC 2')

plottingColors=brewer.pal(length(unique(malignantCellsPatientMetadata)),'Dark2')
names(plottingColors)=unique(malignantCellsPatientMetadata)

str(malignantCellsPatientMetadata)
##  chr [1:1061] "Mel81" "Mel80" "Mel81" "Mel80" "Mel81" "Mel81" "Mel80" ...
str(unique(malignantCellsPatientMetadata))
##  chr [1:6] "Mel81" "Mel80" "Mel79" "Mel78" "Mel88" "Mel89"
str(plottingColors)
##  Named chr [1:6] "#1B9E77" "#D95F02" "#7570B3" "#E7298A" "#66A61E" ...
##  - attr(*, "names")= chr [1:6] "Mel81" "Mel80" "Mel79" "Mel78" ...
str(plottingColors[malignantCellsPatientMetadata])
##  Named chr [1:1061] "#1B9E77" "#D95F02" "#1B9E77" "#D95F02" "#1B9E77" ...
##  - attr(*, "names")= chr [1:1061] "Mel81" "Mel80" "Mel81" "Mel80" ...
plot(results$x[,'PC1'],results$x[,'PC2'],main='PCA of malignant cells',col=plottingColors[malignantCellsPatientMetadata],pch=19,xlab='PC 1',ylab='PC 2')
legend('bottomright',legend=unique(malignantCellsPatientMetadata),fill=plottingColors)

3.2. tSNE – malignant cells

results2D=Rtsne(malignantCellsExpression,dims=2,perplexity=50,theta=0)
plot(results2D$Y,main='tSNE of malignant cells, p=50',col=plottingColors[malignantCellsPatientMetadata],pch=19,xlab='tSNE Component 1',ylab='tSNE Component 2')
legend('topright',legend=unique(malignantCellsPatientMetadata),fill=plottingColors)

3.3. PCA – non-malignant cells

results.n=prcomp(nonMalignantCellsExpression)
plot(results.n$x[,'PC1'],results.n$x[,'PC2'],main='PCA of non-malignant cells',pch=19,xlab='PC 1',ylab='PC 2')

plottingColors=brewer.pal(length(unique(tumorLabels)),'Dark2')
names(plottingColors)=unique(tumorLabels)

plottingColorsImmune=brewer.pal(length(unique(immuneLabels)),'Dark2')
names(plottingColorsImmune)=unique(immuneLabels)

plot(results.n$x[,'PC1'],results.n$x[,'PC2'],main='PCA of non-malignant cells',col=plottingColorsImmune[immuneLabels],pch=19,xlab='PC 1',ylab='PC 2')
legend('bottomright',legend=unique(immuneLabels),fill=plottingColorsImmune)

3.3. tSNE – non-malignant cells

# cluster by patients
results2D.n=Rtsne(nonMalignantCellsExpression,dims=2,perplexity=50,theta=0)
plot(results2D.n$Y,main='tSNE of non-malignant cells, p=50',col=plottingColors[tumorLabels],pch=19,xlab='tSNE Component 1',ylab='tSNE Component 2')
legend('topright',legend=unique(tumorLabels),fill=plottingColors)

# cluster by cell type
results2D.n=Rtsne(nonMalignantCellsExpression,dims=2,perplexity=50,theta=0)
plot(results2D.n$Y,main='tSNE of non-malignant cells, p=50',col=plottingColorsImmune[immuneLabels],pch=19,xlab='tSNE Component 1',ylab='tSNE Component 2')
legend('topright',legend=unique(immuneLabels),fill=plottingColorsImmune, bty='n')

results2D.n=Rtsne(nonMalignantCellsExpression,dims=2,perplexity=10,theta=0)
plot(results2D.n$Y,main='tSNE of non-malignant cells, p=10',col=plottingColorsImmune[immuneLabels],pch=19,xlab='tSNE Component 1',ylab='tSNE Component 2')
legend('topright',legend=unique(immuneLabels),fill=plottingColorsImmune, bty='n')

results2D.n=Rtsne(nonMalignantCellsExpression,dims=2,perplexity=100,theta=0)
plot(results2D.n$Y,main='tSNE of non-malignant cells, p=100',col=plottingColorsImmune[immuneLabels],pch=19,xlab='tSNE Component 1',ylab='tSNE Component 2')
legend('topright',legend=unique(immuneLabels),fill=plottingColorsImmune, bty='n')

## 3.4. Using Seurat

nmce <- CreateSeuratObject(counts = t(nonMalignantCellsExpression), project = "nmce", min.cells = 3, min.features = 200)

nmce <- AddMetaData(nmce,immuneLabels,col.name="Immune")

FeatureScatter(nmce, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",ylim=c(0,5000))

nmce <- subset(nmce, subset = nFeature_RNA > 300 & nFeature_RNA < 1500)

nmce <- FindVariableFeatures(nmce, selection.method = "vst", nfeatures = 1000)
top10 <- head(VariableFeatures(nmce), 10)

nmce <- NormalizeData(nmce, normalization.method = "LogNormalize", scale.factor = 10000)
nmce <- ScaleData(nmce, features = rownames(nmce))
## Centering and scaling data matrix
nmce.pca <- RunPCA(nmce, features = VariableFeatures(object = nmce))
## PC_ 1 
## Positive:  IL32, CD3D, CST7, CD2, SRGN, NKG7, PRKCH, PRF1, GZMA, PDCD1 
##     IL2RB, TIGIT, ID2, ITM2A, CD8A, CTSD, GZMK, SIRPG, DUSP4, CCL4 
##     HCST, KLRK1, PRDM1, CCL4L2, CCL4L1, TOX, LINC00152, PYHIN1, GIMAP7, CD7 
## Negative:  MTRNR2L2, MTRNR2L8, IRF8, SELL, CCR7, HLA-DQB1, RPL3, HLA-DRA, LTB, CD55 
##     HLA-DQA2, HLA-DQA1, FAM65B, CXCR4, ST6GAL1, LY9, GPR183, CD37, TXNIP, RPLP0 
##     SNX2, EEF1G, MIR4461, RPS21, RPL4, RPL29, PDE4B, RPS13, HLA-DRB5, RPL13A 
## PC_ 2 
## Positive:  CCL5, CD2, CD3D, CD8A, GZMK, CST7, CCL4, NKG7, KLRK1, IL32 
##     GZMA, PRF1, CCL4L2, CCL4L1, DENND2D, CBLB, SYTL3, PTPRCAP, PDCD1, TOX 
##     LCK, FYN, LOC100130231, SIRPG, PRKCH, CD7, IL2RB, PYHIN1, PRDM1, ZAP70 
## Negative:  IRF8, SNX2, GPX1, TUBA1B, IFITM3, NME2, PRDX1, TUBB, PSMB6, ST6GAL1 
##     CD55, EIF5A, ANXA2, HLA-DQB1, TPM4, TAGLN2, APEX1, WARS, SYNGR2, CTSA 
##     GDI2, HLA-DRA, POMP, TMEM123, PFKL, CMTM6, RAB1A, SDCBP, EIF4A1, EIF3L 
## PC_ 3 
## Positive:  HLA-DRB1, HLA-DRB5, HLA-DQA1, HLA-DPA1, HLA-DRA, HLA-DQA2, IRF8, HLA-DQB1, PTPRCAP, RAC2 
##     CD37, HLA-DMA, CXCR4, RGS2, PTPN6, CD53, CD69, UCP2, RPS4Y1, TAGAP 
##     LSP1, INPP5D, CD74, PTK2B, CORO1A, PARP15, NR4A2, PARVG, RHOH, SASH3 
## Negative:  C1orf56, XIST, IFITM3, IL7R, LGALS1, S100A11, ANXA1, ANXA2, S100A10, SERINC5 
##     IFI6, S100A6, VIM, GIMAP4, HSPB1, PBXIP1, TCF7, MIR4461, ITGB1, STOM 
##     CD63, TMEM123, PRMT2, RPS27L, IL32, AES, GIMAP7, ANXA5, WTAP, SAR1A 
## PC_ 4 
## Positive:  HLA-DRB1, HLA-DRB5, HLA-DPA1, CD74, CCL4L2, CCL4L1, HLA-DQA1, CCL4, NKG7, HLA-DRA 
##     CD8A, HLA-DQA2, PRF1, CD63, HAVCR2, KLRK1, GZMA, HLA-DMA, CST7, HLA-DQB1 
##     WARS, MTRNR2L8, GSTP1, MTRNR2L2, APOBEC3G, CCL5, SYNGR2, DUSP4, CTSD, HSPB1 
## Negative:  IL7R, TCF7, SERINC5, CCR7, SPOCK2, DGKA, CD6, TMEM123, LTB, TC2N 
##     ITK, FAM65B, SAMHD1, SELL, ZAP70, GPR183, LDHB, LAT, LEPROTL1, PIK3IP1 
##     CD69, FAM102A, CD48, JUNB, PBXIP1, CYTIP, STAT4, CYTH1, TSC22D3, CD247 
## PC_ 5 
## Positive:  EMP3, CD52, CORO1A, LCP1, ARHGDIB, FAM65B, GBP1, XIST, MIR4461, S100A4 
##     SELPLG, PIM2, C1orf56, HCLS1, IFI6, VIM, GBP5, CCND2, CD37, RHOF 
##     PSMB10, HLA-DRA, CD48, LTB, TRAF3IP3, RAC2, ALOX5AP, AES, GNG2, ANXA1 
## Negative:  TSPYL2, HSPA1A, RGS2, DNAJB1, JUN, DUSP1, MTRNR2L8, DNAJA1, MTRNR2L2, FOS 
##     FKBP5, RPS4Y1, CBLB, HSP90AB1, PRKCH, HSPE1, PIK3IP1, TUBA1A, NFAT5, NFKBIA 
##     CREM, PPP1R15A, ZNF331, ZBTB38, HSPA8, HSPB1, RNF19A, NXF1, HIF1A, TOX
VizDimLoadings(nmce.pca, dims = 1:2, reduction = "pca")

pdf('nm_seurat_pca_immune.pdf')
DimPlot(nmce.pca, reduction = "pca",group.by="Immune")
dev.off()
## quartz_off_screen 
##                 2
nmce.clust <- FindNeighbors(nmce.pca, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
nmce.clust <- FindClusters(nmce.clust, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 1042
## Number of edges: 33042
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8205
## Number of communities: 6
## Elapsed time: 0 seconds
nmce.clust <- RunUMAP(nmce.clust, dims = 1:10)
pdf('nm_seurat_umap_immune.pdf')
DimPlot(nmce.clust, reduction = "umap",group.by="Immune")
dev.off()
## quartz_off_screen 
##                 2
nmce.clust <- RunTSNE(nmce.clust, dims = 1:10)
pdf('nm_seurat_tsne_immune.pdf')
DimPlot(nmce.clust, reduction = "tsne",group.by="Immune")
dev.off()
## quartz_off_screen 
##                 2
pdf('nm_seurat_pca.pdf')
DimPlot(nmce.clust, reduction = "pca")
dev.off()
## quartz_off_screen 
##                 2
cluster0.markers <- FindMarkers(nmce.clust, ident.1 = 0, min.pct = 0.25)
cluster1.markers <- FindMarkers(nmce.clust, ident.1 = 1, min.pct = 0.25)
cluster2.markers <- FindMarkers(nmce.clust, ident.1 = 2, min.pct = 0.25)
cluster3.markers <- FindMarkers(nmce.clust, ident.1 = 3, min.pct = 0.25)
cluster4.markers <- FindMarkers(nmce.clust, ident.1 = 4, min.pct = 0.25)
cluster5.markers <- FindMarkers(nmce.clust, ident.1 = 5, min.pct = 0.25)


head(cluster0.markers, n = 5)
##              p_val avg_logFC pct.1 pct.2     p_val_adj
## NKG7 3.454119e-127  1.669612 0.928 0.141 6.908238e-124
## CD8A 1.491073e-111  1.611743 0.888 0.192 2.982146e-108
## PRF1 3.589568e-108  1.540983 0.862 0.169 7.179136e-105
## CST7 1.705780e-100  1.291854 0.943 0.219  3.411560e-97
## GZMA  6.625120e-98  1.448867 0.842 0.128  1.325024e-94
head(cluster1.markers, n = 5)
##                 p_val avg_logFC pct.1 pct.2    p_val_adj
## IL7R     4.238558e-72  1.363104 0.764 0.228 8.477116e-69
## HLA-DRB5 5.378437e-45 -1.208904 0.193 0.714 1.075687e-41
## HLA-DPA1 4.120055e-41 -1.063219 0.295 0.749 8.240109e-38
## HLA-DRB1 7.702127e-40 -1.012377 0.283 0.761 1.540425e-36
## HLA-DQA1 9.884618e-39 -1.425495 0.087 0.570 1.976924e-35
head(cluster2.markers, n = 5)
##                  p_val avg_logFC pct.1 pct.2     p_val_adj
## IRF8     2.515263e-141 1.9819463 0.906 0.116 5.030527e-138
## HLA-DRA  7.246820e-120 1.3475207 1.000 0.398 1.449364e-116
## CD74     2.683786e-107 0.5661465 1.000 0.902 5.367572e-104
## HLA-DQB1  3.552468e-98 1.3194694 0.944 0.309  7.104936e-95
## HLA-DPA1  4.985555e-98 0.9439924 0.996 0.535  9.971110e-95
head(cluster3.markers, n = 5)
##               p_val avg_logFC pct.1 pct.2    p_val_adj
## ITK    1.036214e-22 0.7150541 0.770 0.337 2.072427e-19
## IL6ST  2.818038e-22 0.7435661 0.791 0.458 5.636075e-19
## TOX    6.642883e-22 0.6785674 0.813 0.364 1.328577e-18
## SH2D1A 8.967137e-21 0.7542328 0.662 0.267 1.793427e-17
## RNF19A 3.346206e-20 0.6156584 0.863 0.512 6.692412e-17
head(cluster4.markers, n = 5)
##               p_val avg_logFC pct.1 pct.2    p_val_adj
## IFITM3 1.083594e-38  1.875376 1.000 0.271 2.167188e-35
## ANXA2  4.700911e-34  1.417533 1.000 0.301 9.401822e-31
## CD63   3.485409e-25  1.173200 0.975 0.332 6.970818e-22
## PTPRC  5.076272e-25 -2.173559 0.125 0.992 1.015254e-21
## CXCR4  3.812010e-23 -2.537301 0.050 0.932 7.624020e-20
head(cluster5.markers, n = 5)
##             p_val avg_logFC pct.1 pct.2    p_val_adj
## GPX1 2.674287e-21 1.4131941 0.963 0.312 5.348575e-18
## PSAP 3.504584e-19 1.0876540 1.000 0.572 7.009167e-16
## CD68 1.135705e-17 1.6452605 1.000 0.547 2.271411e-14
## FTL  5.850214e-17 0.5489939 1.000 0.929 1.170043e-13
## SAT1 1.223098e-16 0.8263940 1.000 0.672 2.446196e-13